Framework - methods
Our framework is implemented in NetLogo 6.1.1 (https://ccl.northwestern.edu/netlogo/), the experiments were run on an Intel Core i7-8700, 32Gb RAM system under Ubuntu 18.04, and the results analysed using R 3.6.3/Rstudio 1.4 on machines running Ubuntu 18.04 and macOS 10.15 (Catalina); the full source code and results are available at Github (mathjoss/bayes-in-network).
Our simulation framework is based on previously published models (Dediu, 2009), (Dediu, 2008) and has three main components: the language, the agents, and the communicative network. The language is modelled here as being composed of one (or more) binary features, that are obligatorily expressed in each individual utterance produced or perceived by the agents. We may think of these abstract features as representing, for instance, the use of the alveolar trill /r/ (value 1) or of a different r-like sound (value 0), the use of pitch to make a linguistic distinction (1) or not (0), having a subject-verb word order (1) or a verb-subject order (0), making a gender distinction (1) or not (0), using center embedding (1) or not (0), or any other number of such alternatives. Thus, if we take the /r/ interpretation, a set of utterances {1,1,1} might be produced by an agent that can trill without issues, a {0,0,0} by one that cannot, and {1,0,1} by an agent that either does not make the distinction or whose ability to trill is affected by other factors (e.g., socio-linguistic or co-articulatory). Each agent embodies three components: language acquisition, the internal representation of language, and the production of utterances. The first concerns the way observed data (in the form of “heard” utterances) affect (or not) the internal representation of language that the agent has. The second is the manner in which the agent maintains the information about language. And the third, the way the agent uses its internal representation of the language to produce actual utterances.
We opted here for a Bayesian model of language evolution as introduced by (Griffiths & Kalish, 2007), and widely used in recent studies of language evolution and change (e.g. (Dediu, 2009), (Dediu, 2008), (Kirby, Dowman, & Griffiths, 2007), among others). To do so, we used agent-based modeling in Netlogo, where we created societies of agents sharing connections with each other. Netlogo programs is available on Github mathjoss/bayes-in-network and contains a lot of functionalities not used in this analysis. To understand how to use our Netlogo code and parameters, please refer to Appendix: Netlogo guide.
As a general approach, it proposes that there is a universe of possible languages (discrete or continuous), \(h \in U,\) and that an agent maintains at all times a probability distribution over all these possible languages. Initially, before seeing any linguistic data, the agent has a prior distribution over these possible languages, \(p(h)\), and, as new data (in the form of observed utterances), \(d = \{u_{1}, u_{2}, … u_{n}\}\), come in, this probability is updated following Bayes’ rule, resulting in the posterior distribution: \[p(h|d) = \frac{p(d|h).p(h)}{p(d)}\] that reflects the new representation that the agent has of the probability of each possible language \(h \in U\) after having heard the utterances composing the data \(d\). In this, \(p(d|h)\) is the likelihood that the observed data \(d\) was generated by language \(h\), and \(p(d)\) is a normalisation factor ensuring that \(p(h|d)\) is a probability bounded by 0.0 and 1.0.
In this paper, we model a single binary feature and consequently the utterances, \(u\), collapse to a single bit of information, “0” or “1”. The observed data, \(d\), become binary strings, and one of the simplest models of language is that of throwing a (potentially unfair) coin that returns, with probability \(h \in [0,1]\), a “1” (otherwise, with probability \(1-h\), a “0”). Thus, the universe of our languages, \(h\), is the real number interval \(U = [0,1] \subset {\rm I\!R}\), and the likelihood of observing an utterance \(u \in \{ 0, 1 \}\) is given by the Bernoulli distribution with parameter \(h\); for a set of utterances \(d = \{u_{1}, u_{2}, … u_{n}\}\), the likelihood is given by the binomial distribution with parameters \(k = |\{u_{i}=1\}_{i=1..n}|\) (the number of utterances “1”), \(n\) (the total number of utterances), and \(h: p(d|h) = Binomial(k,n,h) = \frac{n!}{k!(n-k)!}h^{k}(1-h)^{n-k}\), where \(x! = 1 \cdot 2 \cdot ... \cdot (x-1) \cdot x\); thus, we can reduce the set of utterances forming the data \(d\), without any loss of information, to the number of “1” utterances (\(k\)) and the total number of utterances (\(n\)). In Bayesian inference we sometimes use the conjugate prior of a given likelihood, in this case, the Beta distribution defined by two shape parameters, \(\alpha\) and \(\beta\), with probability density \(f(x,\alpha,\beta) = \frac{1}{B(\alpha,\beta)}x^{\alpha-1}(1-x)^{\beta-1}\), where \(B(\alpha,\beta)\) normalizes the density between 0.0 and 1.0. With these, the prior distribution of language \(h\) is \(f(h,\alpha_{0},\beta_{0})\), with parameters \(\alpha_{0}\) and \(\beta_{0}\) defining the shape of this distribution (see below), and the posterior distribution, updated after seeing the data \(d=(k,n)\), is \(p(h|d) = f(h,\alpha_{1},\beta_{1})\), where \(\alpha_{1} = \alpha_{0} + k\) and \(\beta_{1} = \beta_{0} + (n-k)\); thus, the posterior distribution is also distributed Beta, with the shape parameter \(\alpha\) “keeping track” of the “1” utterances, and \(\beta\) of the “0” utterances, and the Bayesian updating is reduced to simple (and very fast) arithmetic operations. When it comes to utterance production, a SAM agent chooses a value \(h \in [0,1]\) from the \(B(\alpha_{1},\beta_{1})\) distribution (i.e., proportional to \(f(h,\alpha_{1},\beta_{1})\))), while a MAP picks the mode of the distribution, \(h_{M} = \frac{\alpha_{1}-1}{\alpha_{1}+\beta_{1}-2}\); afterward, the agent uses this number between 0.0 and 1.0 as the parameter of a Bernoulli distribution (a coin throw) to extract a single “0” or “1” value with this probability – this value then is the utterance that the agent produces.
This choice (Bernoulli/Beta) not necessarily reflects how data is used by real humans in learning a language, but it has several major advantages, most notably its simplicity, transparency, and computational efficiency making it possible to run very large simulations on a consumer-grade computer in reasonable time (Dediu, 2009). Probably the most relevant here concerns the fact that the bias can be modeled only through the shape parameters of the prior Beta distribution, \(\alpha_{0}\) and \(\beta_{0}\), as the likelihood function is fixed to the Binomial, and the utterance produce offers only a limited choice between SAM and MAP. However, the Beta distribution is notoriously flexible, and can be used to represent from (almost) flat (or uninformative) distributions, to extremely peaked and to “U”-shaped ones. Moreover, for unimodal cases, we can model not only the location of the peak (i.e., the “preferred” value), but also the spread of this peak (i.e., how “strong” is this preference, operationally, how much data is needed to change the preferred value); we actually describe the Beta distribution using these alternative parameters, the mode \(\mu\) (describing the “preferred location”) and the “spread” \(\lambda\), which are linked to the shape parameters \(\alpha\) and \(\beta\) (see Box 1). Thus, arguably, the Beta distribution is flexible enough to model relatively well an intuitive view of how such a bias might look like – not just a preferred value but also a strength of this preference. See Figure 1 for an example of how different prior distributions are updated upon seeing some data.
You can see more details about this process in Strength and location of the bias.
Independent variables
In our Netlogo model, we used the following variables:
| Network size |
size_net |
none |
The number of agents (i.e., agents); it is fixed for a given run |
| Frequency of biased agents |
prop_biased |
none |
The proportion of agents in the network that are biased; please note that here we consider networks containing a single type of biased agents |
| Bias location and strength |
bias_strength |
none |
Only the strength value of biased agents varies ; the value for unbiased agent is fixed and set to \(\mu_{0} = 0.5\), \(\lambda_{0} = 0.9\). See more information in Strength and location of the bias |
| Proportion of highest centrality agents that are biased |
influencers_biased |
depends on prop_biased |
More information in Influencers biased |
| Utterance production mechanism |
learners |
none |
More information in Learners |
| Network type |
network |
none |
More information in Network type |
| Initial language |
init_lang |
none |
The total number of utterances (n0) and the number of utterances “1” (k0) presented to all the agents in the network in the initial iteration i = 0. More information in Initial language |
Set of combination 1 - analysis.csv
The parameters we used in the set of combination 1 are the following:
| size_net |
10 (“tiny”) 50 (“small”) 150 (“medium”) 500 (“large”) 1000 (“very large”) |
| prop_biased |
0% (“fully unbiased”) 10% 30% 50% 100% (“fully biased”) |
| bias_strength |
\(\mu_{0} = 0.1,\) \(\lambda_{0} = 0.6\) (“biased flexible”) \(\mu_{0} = 0.1\), \(\lambda_{0} = 0.1\) (“biased rigid”) |
| influencers_biased |
0% (“Random”) 10% (“biased influences”) |
| learners |
SAM (“sampler”) MAP (“a posteriori maximizer”) |
| network |
Random Scale-free Small-world |
| init_lang |
k0 = 0, n0 = 0 (“no initial language”) k0 = 4, n0 = 4 (“initial language”) |
Strength and location of the bias
The first internal representation of the language (at \(t = 0\)) is represented by a Beta distribution (alpha, beta). However, the Beta distribution is notoriously flexible, and can be used to represent from (almost) flat (or uninformative) distributions, to extremely peaked and to “U”-shaped ones. For unimodal cases, we can model:
- \(\mu_{0}\): location of the bias (or mode), i.e., the “preferred” value. The higher, the more likely the individual will produce utterances = 1
- \(\lambda_{0}\): the spread of the peak (or strength of the bias), i.e., how “strong” is this preference, operationally, how much data is needed to change the preferred value. The higher, the less strongly biased.
We actually describe the Beta distribution using these alternative parameters, the mode \(\mu_{0}\) and the “spread” \(\lambda_{0}\), which are linked to the shape parameters (alpha, beta) using a small algorithm:
- the user can select the mode of the Beta distribution (= the location of the bias), and the learning acceptance (= how strong the bias is);
- the program computes the lower and upper uncertainty limits from the given mode and learning acceptance, such that these limits are within the interval [0, 1];
- the values for the mode and the upper and lower limits are passed to the
betaExpert from from the prevalence package, which computes the unique values of alpha and beta; for optimization and future-proofing reasons, we precomputed and hard-coded the alpha and beta values used in this paper within our NetLogo script (available in the GitHub repository mathjoss/bayes-in-network).
(see hidden code below for more information)
In order to run faster simulation, and to prevent compatibility problems in Netlogo, we saved these alpha and beta values directly inside the Netlogo code. For example, this algorithm creates the following alpha and beta values according to \(\mu_{0}\) and \(\lambda_{0}\):
| (0.1, 0.1) |
(10.96, 90.62) |
(“biased rigid”) |
| (0.1, 0.6) |
(1.58, 6.19) |
(“biased flexible”) |
| (0.5, 0.9) |
(2.2, 2.2) |
(“unbiased”) |
They can be visually represented like this:
Hearing utterances will gradually change the internal representation of the language for each agent, whatever their starting distribution.
For example, after hearing 10 and 20 utterances = 1, the internal representation of the language will be like:
Initial language
The initial language parameter corresponds to two situations:
on the one hand, it can model the (quite unrealistic) case where agents are born in a society without any pre-existing language or where they are not exposed to any linguistic input (k0 = 0, n0 = 0), so that the agents must create their first utterances based only on their prior bias (init_lang = 0 or init_lang = no in some plots).
on the other hand, it can model the more common case where agents are born in a society with a pre-existing language already biased towards the use of the feature (k0 = 4, n0 = 4); this is modelled by presenting all the agents with the same 4 utterances “1” in the initial iteration, so that the first utterances generated by the agents are based both on on their prior bias and the linguistic input from the society. In this analysis, the variant supported by agents having a bias (both strong or weak) is always the utterance “0” (init_lang = 4 or init_lang = yes in some plots).
Here is a visualization of the Beta distribution curve of biased and unbiased agents, for both conditions of the initial language of the society:
As the condition with an initial language value of the society is more realistic, we will use this one preferentially in the computations.
Learners
There are two widely-used strategies to produce utterances (among, the many possible ones; (Kirby et al., 2007)):
sampler strategy (SAM): a language h can be sampled at Random from the universe of possible languages proportional to its probability in the posterior distribution \(p(h|d)\).
maximum a posteriori strategy (MAP): we can pick the language \(h_M\) that has the maximum posterior probability \(max_{h \epsilon U}(p(h|d))\)
Time
The network we used is synchronous: the language values of all agents are updated simultaneously at the end of each iteration, after all agents have talked once. More precisely, in a given iteration, each agent is selected in turn in a random order and is allowed to produce one utterance (“speak”), utterance which is “heard” by all its network neighbours. However, the agents do not update their internal representation of language until all have “spoken” (i.e., at the end of the iteration). In this way, each agent has the chance to “speak” and it does so using its representation of language from the previous iteration (if any), unaffected by any utterance they might have “heard” during the current iteration.
Each round, each individual says one utterance, and listen to the utterance(s) of his neighbor(s). The language value of all individuals are updated at the same time, when all individuals have talked once.
In this analysis, we call each round a tick.
We study the evolution of language on a period of time containing 1000 ticks for the main analysis, and 500 ticks for the systematic bias effect study.
Network type
The network represents the socio-linguistic structure of a community, and constrains the linguistic interactions between agents. The agents are the network’s agents, and if there is an edge between two agents then those two agents will engage in linguistic interactions.
Please note that we consider here only static networks: there is no change, during a run, in the number of agents, the properties of the agents (bias and utterance production mechanism and in the topology of the network (i.e., the pattern of edges connecting the agents).
Likewise, our model does not include directed nor weighted edges (i.e., the two connected agents can interact symmetrically, and there is no way to specify that two agents might interact “more” than others), but we do think that dynamic weighted directed networks are an important avenue to explore in the future.
In this analysis, we use 3 different types of networks, always generated randomly in Netlogo: Scale-free, Random, and Small-world networks.
Scale-free networks
Algorithm
We use the preferential attachment algorithm (Barabási, Albert, & Jeong, 2000). It starts from a seed of agents and gradually adds new ones; new links are created between the newly-added agents and the pre-existing agents following the rule that the more a agent is connected, the greater its chance to receive new connections. Formally, the probability \(p_i\) that a new agent is connected to agent \(i\) is \(p_i= \frac{k_i}{\sum_{j}k_j}\), where \(k_i\) is the degree of agent \(i\), and the sum is over all pre-existing agents \(j\).
Characteristics
Scale-free networks exhibit a power law degree distribution: very few agents have a lot of connections, while a lot have a limited number of links. These type of networks are found, for example, on the Internet (Albert, Jeong, & Barabási, 1999) or in cell biology (Albert, 2005).
Small-world networks
Algorithm
We use a classic beta model of the Watts-Strogatz algorithm (Watts & Strogatz, 1998). The algorithm first creates a ring of agents, where each agent is connected to a number \(N\) of neighbours on either side, and then rewired with a chosen probability \(p\).
In this model, we always use the value \(N = 4\) and \(p=0.1\).
Characteristics
This process leads to the creation of hubs and the emergence of short average path lengths. Small-world properties were popularized by (Milgram, 1967)’s “Six degrees of separation” idea, and are found in many real-world phenomena, such as social influence networks (Kitsak et al., 2010) and semantic networks (Kenett et al., 2018).
Random networks
Algorithm
We use Erdos & Renyi popular algorithm (Erdős & Rényi, 1959). We specify the number of agents and the overall connectivity of the graph giving the probability of adding an edge between any two agents (\(p\)); in this model, we always use \(p=0.1\).
Characteristics
It is an unrealistic baseline model, which does not represent the structure of real-world networks.
Visualization

Influencers biased
We are interested to know what happens if the most influential people in a network are biased. To investigate it, we created a variable influencers_biased: it corresponds to the percentage of highest centrality agents that are biased.
As an example, if there are 20% of biased agents in the network, and 15% of influencers biased, it means that the 15% most influential agents will be biased, and 5% of the rest of the network will be randomly biased. Practically speaking:
- if (prop_biased) \(\geq\) (influencers_biased), the (influencers_biased) most popular agents are biased, the rest of biased agents being randomly chosen in the population
- if (prop_biased) \(<\) (influencers_biased), then (prop_biased) most popular agents are biased.
In the main analysis, we selected only 2 values for influencers_biased :
- 0% of influencers biased, so the biased agents in the population are selected randomly;
- 10% of influencers biased, so the 10% most influential agents are biased (if prop_biased >= 10, otherwise the prop_biased most influential agents are biased)
In the systematic bias effect study, we selected 3 values for influencers_biased :
- 0% of influencers biased, so the biased agents in the population are selected randomly;
- 50% of influencers biased, so the 50% most influential agents are biased (if prop_biased >= 50, otherwise the prop_biased most influential agents are biased)
- 100% of influencers biased, so the 100% most influential agents are biased (if prop_biased = 100, otherwise the prop_biased most influential agents are biased)
Here, most influential agents are determined using measures for eigenvector centrality (see more here).
Dependent variables
We measured different variables using Netlogo BehaviorSpace tool.
Here is the exhaustive list of all variables that we used in this analysis:
Language value
The language value of an agent at a given moment varies between 0 and 1, and is the mode of the Beta distribution representing the internal belief of the agent concerning the distribution of the probability of utterances “1” in the language. Biased agents typically start with a lower la than the unbiased agents, thus favoring the variant “0”. We also define the language value of a given group of agents (for example, a community or the whole network) as the mean of the language values of all the agents in the group. We decided to focus on the language value observed after 1,000 iterations, because the language value was always stabilized after this period.
Inter-individual variation across the agents in a given network is an important outcome: we found that most biased and unbiased agents have very similar behaviors within their respective groups, justifying the use of the mean language values of the biased (langval_biased) and the unbiased agents (langval_control). We also computed the mean language value of the whole population (langval_all): even if there may be variation between groups (the biased vs the unbiased agents) and between agents, this value is a global indicator of the average language used in the population.
To summarize, we use the 3 following variables:
- mean language value of all agents (langval_all) at final tick
- mean language value of biased agents (langval_biased) at final tick
- mean language value of unbiased agents (langval_control) at final tick
These variables were recorded directly inside BehaviorSpace, Netlogo.
Difference between unbiased and biased agents
Here, we used the signed difference between the mean language values of the unbiased agents and the mean language values of biased agents, as this gives very similar results to the much more computationally expensive method of computing all pairwise differences between all unbiased and biased agents:
- difference of mean language value of unbiased agents and mean language value of biased agents (diff_group)
We computed this variable from the previous language value means, in R.
Stabilization time
Intuitively, stabilisation time captures how long (in terms of interaction cycles) it takes for the language of a given network to reach a stable state. Given the inhomogeneous nature of the network, we consider two measures: the moment when the language value of the whole population stabilize (stab_all), the moment when the language value of the biased agents stabilizes (stab_biased) and the moment when the language value of the unbiased agents stabilizes (stab_control); these measures are estimated using the language values of their respective populations. To summarize, we use:
- moment when the language value of the society stabilizes (stab_all)
- moment when the language value of the biased agents stabilizes (stab_biased)
- moment when the language value of the unbiased agents stabilizes (stab_control)
Please note that the measure stab_all is the less accurate and representative of the actual behaviour of agents. Consequently, we decided to mainly study the results of stab_biased and stab_control.
These variables were computed using the mean language values above, on R. We used 2 different algorithms to compute the stabilization time (method 1 and method 2). After checking the results, we found method 2 to be more accurate, so this Rmarkdown only shows the results of the analysis using method 2.
Method 1
We used a discrete sliding window in which we estimate the derivative (i.e., change) and we recorded this change. After the window slided along the whole period of time, we selected the 15 values closest to 0. The value we chose as the stabilization time was the earliest value among these 15 values.
This method is based on the method used by (Jannsen, 2018) (p. 79). Pratically speaking:
The maximum number of ticks of our model is \(nIterations = 1000\), and the size of the sliding window is \(w = nIterations/10\). We applied a loess function on the language values in each window, which is a local polynomial regression fitting (see more here). Then, we ran the following equation on the predicted point (regression line):
\[t(e_g) = \frac{e_{g+w} - e_g}{w}\]
and we obtain a sequence of elite fitness scores \(\vec{e}= (t(e_1), t(e_2), ...)\). The algorithm terminates at the end, so \(length(\vec{e}) = nIterations - w\). Then, we selected the 15 values closest to 0 in \(\vec{e}\). Among these values, we selected the value \(t(e_g)\) with the minimum \(g\).
Method 2
The estimation is based on the method developed in Jannsen (2018:79) and used a fixed-size sliding window within which we estimate the change in the language value, we multiply this number by 100, round it, and stop if this number is equal to zero (i.e., the slope is within \(\pm\) 0.001 of 0.0) for 50 consecutive steps. Practically speaking, the maximum number of ticks of our model is \(nIterations = 1,000\), and the size of the sliding window is \(\omega= nIterations/10\). For a given window, we estimated the change, \(t(e_{g})\) using the following formula:
\[t(e_g) = \frac{e_{g+w} - e_g}{w}*100\]
On the rounded \(t(e_{g})\) values, we find the first value of \(g\), \(g_{stabilization}\), when the rounded value of \(t(e_{g})=0\), and we stop if for 50 consecutive steps (i.e., \(g \in [g_{stabilization}.. (g_{stabilization}+50)]\)), there is no change, \(t(e_{g})=0\); in this case, the stabilization time is the first moment where there was no change, namely \(g_{stabilization}\).
Let us visualize where is the stabilization point found for an example.
Dissemination
First, the inter-replication variation is estimated by computing the standard deviation of the language values obtained among the R replications after 1,000 iterations. It captures the influence of various sources of Randomness on each particular run of a given condition, and we computed it for 3 different groups:
- dissemination of the results of different replications for all agents (diss_all)
- dissemination of the results of different replications for biased agents only (diss_biased)
- dissemination of the results of different replications for unbiased agents agents (diss_unbiased)
These variables were computed using the mean language values above, on R.
The results are recorded inside a new table, which gather the values of dissemination for each combination of conditions: data_dissemination.